Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 171
Filtrar
1.
Bioinform Adv ; 3(1): vbad180, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38130879

RESUMO

Motivation: There now exist thousands of molecular biology databases covering every aspect of biological data. This database infrastructure takes significant effort and funding to develop and maintain. The creators of these databases need to make strong justifications to funders to prove their impact or importance. There are many publication metrics and tools available such as Google Scholar to measure citation impact or AltMetrics covering multiple measures including social media coverage. Results: In this article, we describe a series of novel impact metrics that have been applied initially to the UniProt database, and now made available via a Google Colab to enable any molecular biology resource to gain several additional metrics. These metrics, powered by freely available APIs from Europe PubMedCentral and SureCHEMBL cover mentions of the resource in full text articles, including which section of the paper the mention occurs in, grant acknowledgements and mentions in patent applications. This tool, that we call MBDBMetrics, is a useful adjunct to existing tools. Availability and implementation: The MBDBMetrics tool is available at the following locations: https://colab.research.google.com/drive/1aEmSQR9DGQIZmHAIuQV9mLv7Mw9Ppkin and https://github.com/g-insana/MBDBMetrics.

2.
Nature ; 622(7983): 646-653, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37704037

RESUMO

We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the ß-flower fold, added several protein families to Pfam database2 and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.


Assuntos
Bases de Dados de Proteínas , Aprendizado Profundo , Anotação de Sequência Molecular , Dobramento de Proteína , Proteínas , Homologia Estrutural de Proteína , Sequência de Aminoácidos , Internet , Proteínas/química , Proteínas/classificação , Proteínas/metabolismo
3.
Nucleic Acids Res ; 51(18): 9522-9532, 2023 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-37702120

RESUMO

The protein structure prediction problem has been solved for many types of proteins by AlphaFold. Recently, there has been considerable excitement to build off the success of AlphaFold and predict the 3D structures of RNAs. RNA prediction methods use a variety of techniques, from physics-based to machine learning approaches. We believe that there are challenges preventing the successful development of deep learning-based methods like AlphaFold for RNA in the short term. Broadly speaking, the challenges are the limited number of structures and alignments making data-hungry deep learning methods unlikely to succeed. Additionally, there are several issues with the existing structure and sequence data, as they are often of insufficient quality, highly biased and missing key information. Here, we discuss these challenges in detail and suggest some steps to remedy the situation. We believe that it is possible to create an accurate RNA structure prediction method, but it will require solving several data quality and volume issues, usage of data beyond simple sequence alignments, or the development of new less data-hungry machine learning methods.

4.
PLoS One ; 18(9): e0290890, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37729217

RESUMO

Protein regions consisting of arrays of tandem repeats are known to bind other molecular partners, including nucleic acid molecules. Although the interactions between repeat proteins and DNA are already widely explored, studies characterising tandem repeat RNA-binding proteins are lacking. We performed a large-scale analysis of human proteins devoted to expanding the knowledge about tandem repeat proteins experimentally reported as RNA-binding molecules. This work is timely because of the release of a full set of accurate structural models for the human proteome amenable to repeat detection using structural methods. The main goal of our analysis was to build a comprehensive set of human RNA-binding proteins that contain repeats at the sequence or structure level. Our results showed that the combination of sequence and structural methods finds significantly more tandem repeat proteins than either method alone. We identified 219 tandem repeat proteins that bind RNA molecules and characterised the overlap between repeat regions and RNA-binding regions as a first step towards assessing their functional relationship. We observed differences in the characteristics of repeat regions predicted by sequence-based or structure-based methods in terms of their sequence composition, their functions and their protein domains.


Assuntos
Conhecimento , Proteínas de Ligação a RNA , Humanos , Modelos Estruturais , Proteínas de Ligação a RNA/genética , Sequências de Repetição em Tandem/genética , RNA/genética
5.
J Struct Biol ; 215(4): 108023, 2023 12.
Artigo em Inglês | MEDLINE | ID: mdl-37652396

RESUMO

Tandem Repeat Proteins (TRPs) are a class of proteins with repetitive amino acid sequences that have been studied extensively for over two decades. Different features at the level of sequence, structure, function and evolution have been attributed to them by various authors. And yet many of its salient features appear only when looking at specific subclasses of protein tandem repeats. Here, we attempt to rationalize the existing knowledge on Tandem Repeat Proteins (TRPs) by pointing out several dichotomies. The emerging picture is more nuanced than generally assumed and allows us to draw some boundaries of what is not a "proper" TRP. We conclude with an operational definition of a specific subset, which we have denominated STRPs (Structural Tandem Repeat Proteins), which separates a subclass of tandem repeats with distinctive features from several other less well-defined types of repeats. We believe that this definition will help researchers in the field to better characterize the biological meaning of this large yet largely understudied group of proteins.


Assuntos
Proteínas , Sequências de Repetição em Tandem , Proteínas/genética , Proteínas/química , Sequências de Repetição em Tandem/genética , Sequência de Aminoácidos
6.
Bioinform Adv ; 3(1): vbad064, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37359723

RESUMO

Motivation: The visualization of biological data is a fundamental technique that enables researchers to understand and explain biology. Some of these visualizations have become iconic, for instance: tree views for taxonomy, cartoon rendering of 3D protein structures or tracks to represent features in a gene or protein, for instance in a genome browser. Nightingale provides visualizations in the context of proteins and protein features. Results: Nightingale is a library of re-usable data visualization web components that are currently used by UniProt and InterPro, among other projects. The components can be used to display protein sequence features, variants, interaction data, 3D structure, etc. These components are flexible, allowing users to easily view multiple data sources within the same context, as well as compose these components to create a customized view. Availability and implementation: Nightingale examples and documentation are freely available at https://ebi-webcomponents.github.io/nightingale/. It is distributed under the MIT license, and its source code can be found at https://github.com/ebi-webcomponents/nightingale.

7.
Proteins ; 91(8): 1007-1020, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-36912614

RESUMO

Bacterial fibrillar adhesins are specialized extracellular polypeptides that promote the attachment of bacteria to the surfaces of other cells or materials. Adhesin-mediated interactions are critical for the establishment and persistence of stable bacterial populations within diverse environmental niches and are important determinants of virulence. The fibronectin (Fn)-binding fibrillar adhesin CshA, and its paralogue CshB, play important roles in host colonization by the oral commensal and opportunistic pathogen Streptococcus gordonii. As paralogues are often catalysts for functional diversification, we have probed the early stages of structural and functional divergence in Csh proteins by determining the X-ray crystal structure of the CshB adhesive domain NR2 and characterizing its Fn-binding properties in vitro. Despite sharing a common fold, CshB_NR2 displays an ~1.7-fold reduction in Fn-binding affinity relative to CshA_NR2. This correlates with reduced electrostatic charge in the Fn-binding cleft. Complementary bioinformatic studies reveal that homologues of CshA/B_NR2 domains are widely distributed in both Gram-positive and Gram-negative bacteria, where they are found housed within functionally cryptic multi-domain polypeptides. Our findings are consistent with the classification of Csh adhesins and their relatives as members of the recently defined polymer adhesin domain (PAD) family of bacterial proteins.


Assuntos
Antibacterianos , Proteínas de Membrana , Ligantes , Proteínas de Membrana/química , Bactérias Gram-Negativas/metabolismo , Bactérias Gram-Positivas/metabolismo , Adesinas Bacterianas/genética , Adesinas Bacterianas/química , Adesinas Bacterianas/metabolismo , Proteínas de Bactérias/química
8.
Nucleic Acids Res ; 51(D1): D9-D17, 2023 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-36477213

RESUMO

The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the status of services that EMBL-EBI data resources provide to scientific communities globally. The scale, openness, rich metadata and extensive curation of EMBL-EBI added-value databases makes them particularly well-suited as training sets for deep learning, machine learning and artificial intelligence applications, a selection of which are described here. The data resources at EMBL-EBI can catalyse such developments because they offer sustainable, high-quality data, collected in some cases over decades and made openly availability to any researcher, globally. Our aim is for EMBL-EBI data resources to keep providing the foundations for tools and research insights that transform fields across the life sciences.


Assuntos
Inteligência Artificial , Biologia Computacional , Gerenciamento de Dados , Bases de Dados Factuais , Genoma , Internet
9.
Nucleic Acids Res ; 51(D1): D418-D427, 2023 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-36350672

RESUMO

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.


Assuntos
Bases de Dados de Proteínas , Humanos , Sequência de Aminoácidos , Inteligência Artificial , Internet , Proteínas/química , Software
10.
Nat Comput Sci ; 3(6): 514-521, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38177425

RESUMO

The carbon footprint of scientific computing is substantial, but environmentally sustainable computational science (ESCS) is a nascent field with many opportunities to thrive. To realize the immense green opportunities and continued, yet sustainable, growth of computer science, we must take a coordinated approach to our current challenges, including greater awareness and transparency, improved estimation and wider reporting of environmental impacts. Here, we present a snapshot of where ESCS stands today and introduce the GREENER set of principles, as well as guidance for best practices moving forward.

11.
Artigo em Inglês | MEDLINE | ID: mdl-36572336

RESUMO

Biological databases serve as a global fundamental infrastructure for the worldwide scientific community, which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations in a wide range of research fields. Given the rapid data production, biological databases continue to increase in size and importance. To build a catalog of worldwide biological databases, therefore, we curate a total of 5825 biological databases from 8931 publications, which are geographically distributed in 72 countries/regions and developed by 1975 institutions (as of September 20, 2022). We further devise a z-index, a novel index to characterize the scientific impact of a database, and rank all these biological databases as well as their hosting institutions and countries in terms of citation and z-index. Consequently, we present a series of statistics and trends of worldwide biological databases, yielding a global perspective to better understand their status and impact for life and health sciences. An up-to-date catalog of worldwide biological databases as well as their curated meta-information and derived statistics is publicly available at Database Commons (https://ngdc.cncb.ac.cn/databasecommons/).

12.
Elife ; 112022 11 24.
Artigo em Inglês | MEDLINE | ID: mdl-36421765

RESUMO

EROS (essential for reactive oxygen species) protein is indispensable for expression of gp91phox, the catalytic core of the phagocyte NADPH oxidase. EROS deficiency in humans is a novel cause of the severe immunodeficiency, chronic granulomatous disease, but its mechanism of action was unknown until now. We elucidate the role of EROS, showing it acts at the earliest stages of gp91phox maturation. It binds the immature 58 kDa gp91phox directly, preventing gp91phox degradation and allowing glycosylation via the oligosaccharyltransferase machinery and the incorporation of the heme prosthetic groups essential for catalysis. EROS also regulates the purine receptors P2X7 and P2X1 through direct interactions, and P2X7 is almost absent in EROS-deficient mouse and human primary cells. Accordingly, lack of murine EROS results in markedly abnormal P2X7 signalling, inflammasome activation, and T cell responses. The loss of both ROS and P2X7 signalling leads to resistance to influenza infection in mice. Our work identifies EROS as a highly selective chaperone for key proteins in innate and adaptive immunity and a rheostat for immunity to infection. It has profound implications for our understanding of immune physiology, ROS dysregulation, and possibly gene therapy.


Assuntos
Doença Granulomatosa Crônica , NADPH Oxidases , Humanos , Animais , Camundongos , NADPH Oxidases/metabolismo , Espécies Reativas de Oxigênio/metabolismo , Fagócitos/metabolismo , Transdução de Sinais/fisiologia
13.
Nat Struct Mol Biol ; 29(11): 1056-1067, 2022 11.
Artigo em Inglês | MEDLINE | ID: mdl-36344848

RESUMO

Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.


Assuntos
Biologia Computacional , Furilfuramida , Biologia Computacional/métodos , Sítios de Ligação , Proteínas/química , Bases de Dados de Proteínas , Conformação Proteica
14.
Bioinform Adv ; 2(1): vbac072, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36408459

RESUMO

Motivation: The conventional methods to detect homologous protein pairs use the comparison of protein sequences. But the sequences of two homologous proteins may diverge significantly and consequently may be undetectable by standard approaches. The release of the AlphaFold 2.0 software enables the prediction of highly accurate protein structures and opens many opportunities to advance our understanding of protein functions, including the detection of homologous protein structure pairs. Results: In this proof-of-concept work, we search for the closest homologous protein pairs using the structure models of five model organisms from the AlphaFold database. We compare the results with homologous protein pairs detected by their sequence similarity and show that the structural matching approach finds a similar set of results. In addition, we detect potential novel homologs solely with the structural matching approach, which can help to understand the function of uncharacterized proteins and make previously overlooked connections between well-characterized proteins. We also observe limitations of our implementation of the structure-based approach, particularly when handling highly disordered proteins or short protein structures. Our work shows that high accuracy protein structure models can be used to discover homologous protein pairs, and we expose areas for improvement of this structural matching approach. Availability and Implementation: Information to the discovered homologous protein pairs can be found at the following URL: https://doi.org/10.17863/CAM.87873. The code can be accessed here: https://github.com/VivianMonzon/Reciprocal_Best_Structure_Hits. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

15.
PLoS Comput Biol ; 18(10): e1010610, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36260616

RESUMO

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.


Assuntos
Proteínas , Bases de Dados de Proteínas , Proteínas/genética , Análise por Conglomerados , Sequência de Aminoácidos , Domínios Proteicos
16.
Database (Oxford) ; 20222022 08 12.
Artigo em Inglês | MEDLINE | ID: mdl-35961013

RESUMO

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.


Assuntos
Genômica , Proteínas , Sequência de Bases , Biologia Computacional , Genoma , Anotação de Sequência Molecular
17.
Nature ; 609(7925): 144-150, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35850148

RESUMO

Retrons are prokaryotic genetic retroelements encoding a reverse transcriptase that produces multi-copy single-stranded DNA1 (msDNA). Despite decades of research on the biosynthesis of msDNA2, the function and physiological roles of retrons have remained unknown. Here we show that Retron-Sen2 of Salmonella enterica serovar Typhimurium encodes an accessory toxin protein, STM14_4640, which we renamed as RcaT. RcaT is neutralized by the reverse transcriptase-msDNA antitoxin complex, and becomes active upon perturbation of msDNA biosynthesis. The reverse transcriptase is required for binding to RcaT, and the msDNA is required for the antitoxin activity. The highly prevalent RcaT-containing retron family constitutes a new type of tripartite DNA-containing toxin-antitoxin system. To understand the physiological roles of such toxin-antitoxin systems, we developed toxin activation-inhibition conjugation (TAC-TIC), a high-throughput reverse genetics approach that identifies the molecular triggers and blockers of toxin-antitoxin systems. By applying TAC-TIC to Retron-Sen2, we identified multiple trigger and blocker proteins of phage origin. We demonstrate that phage-related triggers directly modify the msDNA, thereby activating RcaT and inhibiting bacterial growth. By contrast, prophage proteins circumvent retrons by directly blocking RcaT. Consistently, retron toxin-antitoxin systems act as abortive infection anti-phage defence systems, in line with recent reports3,4. Thus, RcaT retrons are tripartite DNA-regulated toxin-antitoxin systems, which use the reverse transcriptase-msDNA complex both as an antitoxin and as a sensor of phage protein activities.


Assuntos
Antitoxinas , Bacteriófagos , Retroelementos , Salmonella typhimurium , Sistemas Toxina-Antitoxina , Antitoxinas/genética , Bacteriófagos/metabolismo , DNA Bacteriano/genética , DNA de Cadeia Simples/genética , Conformação de Ácido Nucleico , Prófagos/metabolismo , DNA Polimerase Dirigida por RNA/metabolismo , Retroelementos/genética , Salmonella typhimurium/genética , Salmonella typhimurium/crescimento & desenvolvimento , Salmonella typhimurium/virologia , Sistemas Toxina-Antitoxina/genética
18.
J Bacteriol ; 204(6): e0010722, 2022 06 21.
Artigo em Inglês | MEDLINE | ID: mdl-35608365

RESUMO

Fibrillar adhesins are bacterial cell surface proteins that mediate interactions with the environment, including host cells during colonization or other bacteria during biofilm formation. These proteins are characterized by a stalk that projects the adhesive domain closer to the binding target. Fibrillar adhesins evolve quickly and thus can be difficult to computationally identify, yet they represent an important component for understanding bacterium-host interactions. To detect novel fibrillar adhesins, we developed a random forest prediction approach based on common characteristics we identified for this protein class. We applied this approach to Firmicutes and Actinobacteria proteomes, yielding over 6,500 confidently predicted fibrillar adhesins. To verify the approach, we investigated predicted fibrillar adhesins that lacked a known adhesive domain. Based on these proteins, we identified 24 sequence clusters representing potential novel members of adhesive domain families. We used AlphaFold to verify that 15 clusters showed structural similarity to known adhesive domains, such as the TED domain. Overall, our study has made a significant contribution to the number of known fibrillar adhesins and has enabled us to identify novel members of adhesive domain families involved in bacterial pathogenesis. IMPORTANCE Fibrillar adhesins are a class of bacterial cell surface proteins that enable bacteria to interact with their environment. We developed a machine learning approach to identify fibrillar adhesins and applied this classification approach to the Firmicutes and Actinobacteria Reference Proteomes database. This method allowed us to detect a high number of novel fibrillar adhesins and also novel members of adhesive domain families. To confirm our predictions of these potential adhesin protein domains, we predicted their structure using the AlphaFold tool.


Assuntos
Adesivos , Proteoma , Adesinas Bacterianas/metabolismo , Bactérias/genética , Bactérias/metabolismo , Aderência Bacteriana , Humanos , Proteínas de Membrana/química , Domínios Proteicos
19.
Nat Biotechnol ; 40(6): 932-937, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35190689

RESUMO

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.


Assuntos
Aprendizado Profundo , Sequência de Aminoácidos , Bases de Dados de Proteínas , Humanos , Anotação de Sequência Molecular , Proteoma/metabolismo , Proteômica
20.
Bioinform Adv ; 2(1): vbab043, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36699409

RESUMO

Motivation: The release of AlphaFold 2.0 has revolutionized our ability to determine protein structures from sequences. This tool also inadvertently opens up many unanticipated opportunities. In this article, we investigate the AntiFam resource, which contains 250 protein sequence families that we believe to be spurious protein translations. We would not expect proteins belonging to these families to fold into well-ordered globular structures. To test this hypothesis, we have attempted to computationally determine the structure of a representative sequence from all AntiFam 6.0 families. Results: Although the large majority of families showed no evidence of globular structure, we have identified one example for which a globular structure is predicted. Proteins in this AntiFam entry indeed seem likely to be bona fide proteins, based on additional considerations, and thus AlphaFold provides a useful quality control for the AntiFam database. Conversely, known spurious proteins offer useful set of quality controls for AlphaFold. We have identified a trend that the mean structure prediction confidence score pLDDT is higher for shorter sequences. Of the 131 AntiFam representative sequences <100 amino acids in length, AlphaFold predicts a mean pLDDT of 80 or greater for six of them. Thus, particular care should be taken when applying AlphaFold to short protein sequences. Availability and implementation: The AlphaFold predictions for representative sequences can be found at the following URL: https://drive.google.com/drive/folders/1u9OocRIAabGQn56GljoG1JTDAxjkY1ro. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...